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Abstract. This work is a re-examination of the sparse Bayesian learning (SBL) 


of linear regression models of Tipping (20011 in a high-dimensional setting. We 


propose a hard-thresholded version of the SBL estimator that achieves, for orthog¬ 
onal design matrices, the non-asymptotic estimation error rate of a^s log p/y/n, 
where n is the sample size, p the number of regressors, a is the regression model 
standard deviation, and s the number of non-zero regression coefficients. We also 
establish that with high-probability the estimator identifies the non-zero regression 
coefficients. In our simulations we found that sparse Bayesian learning regression 


performs better than lasso (Tibshirani (19961) when the signal to be recovered is 
strong. 


1. Introduction 


High-dimensional variable selection has become an important topic in modern sta¬ 


tistics. The least absolute shrinkage and selection operator (lasso) of Tibshirani (1996) 
is probably the most widely used method for this problem and has span an extensive 


literature (see e.g. the monograph Biihlmann and van de Geer (2011)). Despite its 
success, the method has many shortcomings. For instance choosing the right amount 
of regularization remains a difficult and computer-intensive issue for many models. In 
parallel to the frequentist approach, Bayesian variable selection for high-dimensional 


problems has also generated a large literature (see for instance O’Hara and Sillanpaa 


(2009) and the reference therein). But most Bayesian variable selection methods often 
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lead to intractable posterior distributions that require a heavy use of Markov Chain 
Monte Carlo simulation. Between these two well-established frameworks lies an em¬ 


pirical Bayes alternative known as sparse Bayesian learning (SBL, Tipping (2001)), 
which has received much less attention in the statistical literature. 

This paper is a re-examination of the SBL for linear regression in a high-dimensional 
setting. An interesting question is whether the SBL procedure recovers the sparsity 
structure of underlying signals. This problem was considered by Wipf and Rao[ (2004) 
which establishes that in the noiseless setting the SBL indeed recovers the sparsity 
structure of the regression coefficients. However the method behaves differently in a 
noisy setting. For orthogonal design matrices, we show that the SBL indeed produces 
a sparse solution of the regression coefficients, but does not in general recover the 
sparsity structure of the regression coefficients. To remedy this limitation we propose 
a hard-thresholded version of the SBL estimator. We show that with high probability 
the thresholded estimator achieves the same estimation error of 0{a\Js log(p)/n) as 
lasso, where n is the sample size, a is the regression model standard deviation, p 
the number of regressors and s the number of non-zero regression coefficients. Fur¬ 
thermore we show that with high probability this thresholded estimator recovers the 
sparsity structure of the regression coefficients provided that the signal is not too 
weak. 

Finally we did a simulation study comparing SBL and lasso. We find that the per¬ 
formance of SBL depends on the strength of the signal (defined here as the minimum 
of the absolute value of the non-zero coefficients). With a weak signal SBL performs 
poorly compared to lasso, but outperforms lasso when the signal is strong. 

The paper is organized as follows. We introduce the SBL method at the begin¬ 
ning of Section We study the computation and the sparsity structure of the SBL 
estimator in Sections 12.1112.31 The hard-thresholded estimator is defined and studied 


in Section 2.2 The simulation study is reported in Section 2.4 and all the techni¬ 
cal proofs are grouped in Section]^ We end the paper with some open problems in 
Section [3l 


2. Sparse Bayesian learning of linear regression models 

Suppose that we observe a vector y G that is a realization of a random variable 
Y such that 


y = A/3* + e. 


(1) 
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for a known and non-random design matrix X G a vector /3* G and a 

random error term e G M"' such that 


E(e) = 0, and E(ee^) = cr^In, 


( 2 ) 


for > 0, where is the n-dimensional identity matrix. Our objective is to estimate 
/3* and u*. Although (l][2) does not make any specific distributional assumption on Y, 
we will consider the following possibly misspecified model: Y N{X(3, a'^ In) , with 
parameter G x (0, oo), where N(//,S) denotes the Gaussian distribution 

with mean /i and covariance matrix S. The parameter cr^ is taken as fixed, and we 
assign to /3 a prior distribution of the form 


%(d/3) = 

i=i 


(3) 


def 

for a (hyper)-parameter 7 = ( 71 ,, 7 p) G 0 = [0, 00)^, where for a > 0, pa denotes 
the distribution of N(0, a), the Gaussian distribution on M with mean 0 and variance 
a, and po{du) (5o(du) denotes the Dirac measure at 0. The posterior distribution 
of (3 given Y = y and given the hyper-parameter ( 7 , cr^) is therefore 

7 r„(d/ 3 |y,o-^ 7 ) oc exp ||y - A/3|p^ 7 r^(d/ 3 ). (4) 

Sampling from the posterior distribution 7 rn,(-|y, cj^, 7 ) is straightforward. Indeed, 
for 7 = ( 71 ,... , 7 p) G 0, denote \^ '= {1 < j < p '■ / 0} the sparsity structure 

defined by 7 . Notice that for j ^ l.y (that is = 0), iTy puts probability mass 1 
on the event {/3j = 0}, and so does the posterior distribution 7 r„(-|y, cr^, 7 ). Hence 
7 r„(-|y, fj^, 7 ) is the distribution of the random variable (Hi,..., Bp) obtained by sim¬ 
ulating {Bj,j G l.y} from N(//.y,cr^Vly), and by setting the remaining components to 
0, where 

p ^ = V ^ X '^ y , = { X !^ X ^ + a ^ r - y \ (5) 

where X^ is the matrix obtained from X by removing the columns j for which 7 ^ = 0, 
and r.y is the diagonal matrix with diagonal elements given by { 7 ^, j G 1^}. With this 
Gaussian linear model, and prior Q, it is easy to check that the marginal distribution 
of y is N(0, C^), where 

^ def 2tt , / 

Cy - CT 1^2 + y ^ 'Jj Xj Xj , 
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and Xj is the j-th column of X. Therefore, up to a normalizing constant that we 
ignore, the log-likelihood of (cr^, 7 ) is given by 


-7logdet(C^) - ^Tr {C^^yy') 


The sparse Bayesian learning (SBL) estimator of /3* as proposed by Tipping (2001); 


Faul and Tipping (2002) is the empirical Bayes estimator of /3 given by 


/3 


n 


j /?7r„(d/3|?/,(T^,7„), 


( 6 ) 


where 

= Argmax(^2^^)gR^xe (7) 

Notice that /3„ is straightforward to compute once and 7 ^ are available. Indeed 
given and 7 „, jSnj = 0 for all j such that ■jnj = 0 , and for the other components 
j G 1.^^, we have from ([^ that 


Remark 1. The presentation of the SBL given above is slightly different from the 


original presentation of Tipping (2001); Faul and Tipping (2002). The key difference 
here is that in the prior distribution tt^ we allow the components of 7 to take the value 
zero. This is needed for the estimator 7 ^ to be well-defined, and for the well-posedness 
of the question of whether the procedure produces sparse solutions. 

□ 


Computationally, the optimization problem ([^ is not a “nice” problem because 
the objective function is non-concave and typically attains its maximum at 

the boundary of the domain 0 (that is some of the components of its solution(s) are 
exactly zeros). We return to the issue of solving ([^ in Section 2.3 But statistically 
([^ is interesting as it yields a sparse solution 7 ^ as we shall see. 


2.1. Existence of 7 ^. Since the log-likelihood function £ is not concave in general, it 
is not immediately clear that the optimization problem ([^ has a solution. It is even 
less clear whether the solution is sparse. Focusing on the case where is assumed 
known, we show that a solution always exists. 

Proposition 2. Fix y G X G and = a^. Then the maximization 

problem Argmax^^Q£{'y,a‘^) has at least one solution 7 = ( 71 ,..., 7 p) which has the 
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following property: 


f 


if (x'f^yy>x'f.\j 


u = ] 

{x'f-^XjY 

( 8 ) 

1 

0 

otherwise , 


where Cj is given by 

Cj aHn + 

^ %XkxY 





Proof. See Section 4.1 


□ 


It is important to notice that there is no randomness involved in the above result: y 
and X are given and fixed. In particular we do not assume 0 nor Q . It is clear that 
this result does not give the expression of the maximizer since the right-hand side of 
Q also depends on 7. Rather it gives coherence relationships between components 
of the solution. But more importantly the proposition shows that the optimization 
problem Q leads to sparse solutions 7. One can interpret the term x'^Cj^y as a 
measure of correlation between the y and the j-th column Xj of X. Hence the result 
shows that if the correlation between Xj and y is sufficiently weak then (and 
hence ^n,j) is set exactly equal to zero. Of course Proposition is useful only to the 
extent that the inequality < x'^Cj^Xj is satisfied with high probability 

when = 0. We investigate this below. Unfortunately we will see that in general 
7 does not recover exactly the sparsity structure of even in the most favorable 
setting. We make the following distributional assumption. 


HI. The data generating model 0{^ holds and e ~ N{0, a'^In), for some a* > 0. 


We shall also focus our analysis on the idealized case where the matrix X has 
orthogonal columns. 

H2. The design matrix X G is such that {xk,Xj) = 0 whenever j 7^ k. 


Proposition 3. Suppose that 411 hold, and Then for any j G {I,... ,p} 

such that = 0, 


where Z ~ N{0, 1). 


Proof. See Section 4.2 


□ 


P [%,j = 0] = P < 1] 0.68, 
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The result above shows that even in the idealized setting of I® and under the 
Gaussian linear model assumption, the SBL procedure will set 7j to 0 (for j ^ I) only 
about 70% of the time, regardless of the sample size. We do not know whether this 
result continue to hold for more general design matrices. The behavior of the solution 
of 0 for a general design matrix is technically more challenging. 

Another important limitation of the SBL procedure is the computation of jn and 
(7^. Typically iterative methods (such as the EM algorithm, see Section 2.3) are used. 
The EM algorithm does not promote sparsity, and converges to the solution only at 
the limit. Therefore, in finite time, the solutions generated by the EM algorithm are 
typically not sparse at all. 

These two shortcomings limit the usefulness of the basic SBL procedure as an 
interesting method for sparse signal recovery. However, we observe that when /3*j' = 0, 
and the condition (x'^Cj} t') < x',C~} Xj fails, assuming again the most favorable 

\ J Jiln J J Jnn ^ 

setting of 10 7j is given 


— 


- 1) 
{Xj,Xj) 


|- 4 ^ 


where Zj ~ N(0,1). Hence 7j has mean zero and variance of order 0(||xj 

We conclude that when SBL fails to set to zero a component j such that 
/3*7 = 0, the computed SBL solution jj is typically very small. This suggests that a 
thresholded version of 7„ should be able to set these terms to zero. We pursue this 


approach in Section 2.2 


2.2. A thresholded version and its statistical properties. We saw in Section 


2.1 that although sparse, 7„ does not recover in general the sparsity structure of /3* 


To improve on this we propose a modified, hard-thresholded version of % denoted 7„ 
and defined as follows. For 1 < j < p, 


def 

— 


Jn,j if ln,j > 


otherwise 


(9) 


for a thresholding parameter z* that we set to z* = c(l + |p|) logp, for a constant c, 
and where p is an estimate of the largest correlation among the columns of X. The 
corresponding modified estimator of /3* is 


/3n =^y /37r„(d/3|y,d^,7n). 
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Theorem 4. Assume and suppose that is known, logs > and z* = 

Co logp, for some constant cq > 2, where s = |/-y^|. Then 

,CT^S log (p) 


-/3*||2 <Af- 


n 


with probability at least l~ p(eo^s)/8 ~ exp(s) ^ where M = 


4(2+co) 


Proof. See Section 4.3 


( 10 ) 

, anrf c = mini<j<p IIXjIp/n. 

□ 


We deduce the following corollary. For u G sign(?x) = (si,...,Sp) where for 
each i, Si = 0 if Uj = 0, Sj = 1 is Ui > 0, and s* = — 1 if rtj < 0. 


Corollary 5. In addition to the assumptions of Theorem\^ suppose that 

I Ma^slogp 


min |/3*j| > 


n 


( 11 ) 


Then with probability at least 1- qp - Ipp-cp —, sign{/3n) = signijlA. 


Proof. See Section 4.4 


□ 


2.3. Computing and 7„. Here we address the issue of solving Q. Because the 
function is not concave, and typically attains its maximum at the boundary of 

the domain 0, the optimization Q is not a smooth problem. The strategy originally 
developed by Tipping (2001) focuses instead on the smooth problem obtained by 
maximizing i over the open domain where M+ (0,oo). That is, find 

Argmax(^2 ,^)gjjp+i £(cTp 7). (12) 

Of course, this latter problem has no solution whenever the solution of ([^ occurs at 
the boundary of 0. Nevertheless, we will see that an EM algorithm that attempts to 
solve (12) produces sequences that converge to the solution of ([^. 

Since the likelihood function exp(l') of (cr^,7) is obtained by integrating out /?, we 


can treat /3 as a missing variable and use the EM algorithm as proposed by Tipping 
(2001). Eor 7 G M^, the so-called complete log-likelihood takes the form 

p / 


4om(/3w^7|y) = + 


1 


j=i \ 
= (0,cx))^'''^, set 


7i 


2cj2"" 2 

Given a working solution ({cj^}^^P 7^*^p G 

Q(cjp7|{cj2}(4,7(0) f 4o,^(/l,fT^,7|y)7r„(d/3|y,{o-2}(4,^(0) 
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the so-called Q-function. We will use the upper-script (k) to index sequences gen¬ 
erated by the EM algorithm. Set = (XW-|-and = 

V^^'^X'y, where = diag( 7 j^\ ... , 7 p^^). Maximizing Q{-\{cr'^}^^\'y^^^) is easy and 


gives 


{^ 2 }(fc+l) ^ 1 

n 


= n-^ (\\y- Xy^'^'> f + {a^}, 

and for j = 1,... ,p, 

This leads to the following algorithm for solving @ 

Algorithm 1 (EM algorithm). Given 7*-^^) G = (0, oo)^"'“^, we com¬ 

pose the matrix E^^^ = diag( 7 |^\ ... ,7p*^^). 

(1) Compute = {X'X + {u2}W{e(^)}-i)“\ and = V^^^X'y. 

(2) Set 

{A)(*+'> = i (lls - + {(T^}<*>Tr(rWYX)) . 


Although this EM algorithm is designed to solve the maximization problem (12) 
we will see that it typically converges to the solution of ([^. To simplify the analysis 
we assume again that holds and that cr^ is fixed. Hence we focus only on the 
recursion in 7 : 

(fc+l) r {k)^2 I 2Tr(k) ■ 

'j -iy) } + ^ ’ J = 

With the assumption that the design matrix is orthogonal, we can work out explicitly 
the terms = (A'X -|- c 7 ^{E(^)}“^) ^ and = V^^'>X'y, which leads to 


(fc+i) 


{xj,yy 


+ (fe) 
> i 




a 




j = 1,... ,p. 


(13) 


77fe) 


Proposition 6 . Fix y G M"', and X G such that iM holds. Fix cr^ > 0. Let 


k > 0} denote the sequence produced by the recursion (13) for some initial 7 ^°^ 
with positive components. Then for all j G {1,... ,p}, 


lim = Jnj = 

k^oo 


‘''"‘yfX"" .f {y.Xjfxr^W^ir- 

0 otherwise 
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Proof. See Section 4.5 


□ 


Remark 7. In the non-orthogonal design setting, our simulation results suggest that 
the conclusion of Proposition continues to hold, although we do not have any rig¬ 
orous proof. 


2.4. A simulation study. 


2.4.1. Synthetic Data Sets. We investigate by simulation the behavior of the SBL 
procedure and its thresholding version, and how they compare with lasso. For all the 
simulations, n = 100 and p = 500. We generate the design matrix X by simulating 
each row independently from the Gaussian distribution N(0, S) where Sji = 1 and 
Tiij = p for i ^ j. We consider two values of p: p = 0 for which X is close to satisfy 10 
and p = 0.9 which produces a design matrix X with strongly correlated variables. We 
simulate the dependent variable Y from the N(X/1*, with cr* = 1. We consider 

four (4) different scenarios of sparsity, with s = 3,15,25, and s = 50 where s is the 
number of non-zero elements of /3*. The magnitude of the non-zero elements also play 
an important role in the recovery. We generate all the non-zeros components of /?* 
from the uniform distribution U(a, a -|- 1), for a ranging from 0 to 9. 

For each value of p, each sparsity level, and each signal strength a, we repeat 
each estimator 30 times, and we compute the relative error rate (||/3 — /3*||/||/3*||), the 
sensitivity and the specificity, averaged over these 30 replications. The sensitivity 
(SEN) and the specificity (SPE) of a given estimator f3 are defined as 


SEN0) 


Ei=l 


and SPE(/3) 


{/ 3 ,¥ 0 } 


These measures are valid for any estimator /3, and we compute them for the thresh- 
olded version of SBL, the non-thresholded version of SBL, as well as for the lasso es¬ 
timator. For the thresholded SBL, we use z* = c(l -|- |/5|) logp, where c is determined 
by minimizing the BIG: -|- slog(n). 

We compute the lasso estimator using the function cv.glmnet of the package GLM- 


Net (Friedman et al. (2010)) where we select the penalty term A by a 10-fold cross- 
validation procedure. In the cross-validation, the regulation parameter selected min¬ 
imizes the prediction error. 

The simulation results are presented on Figure 1-8. As one can see from these 
figures, the main conclusion is that SBL is more sensitive than lasso to the strength 
of the signal (defined here as as the parameter a). With a weak signal it performs 
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poorly, but outperforms lasso when the signal is strong enough. Another interesting 
finding is that, overall, lasso performs poorly in selecting the non-zeros components 
(variable selection). This is consistent with recent results (Meinhausen and Yu (2009)) 
which shows that variable selection consistency of lasso requires the irrepresentable 
condition, which actually is a very strong condition that often does not hold in prac¬ 
tice. For instance, the irrepresentable condition fails for all the design matrices of 
this simulation study, except for the design matrix behind Figure 2. 

2.4.2. A simulated real data example. In this example, we consider a micro-array data 
concerning genes involved in the production of riboflavin. The data is made publicly 
available at 

http://WWW.annualreviews.org/doi/suppl 

/10.1146/annurev-statistics-022513-115545 

and contains n = 71 samples and p = 4088 covariates corresponding to 4088 genes. 
Each of the sample contains a real valued response consisting of the logarithm of the 
riboflavin production rate and 4088 real valued covariates consisting of the logarithm 
of the genes’ expression levels. 

Given the very high dimensionality of this dataset, the lack of any true value of 
the parameter, and given also the fact that micro-array data are well-known to be 
very noisy, direct comparison of different regression methods on such dataset cannot 
be very insightful. For a more meaningful comparison, we use the riboflavin design 
matrix X G ]^7ix4088 generate simulated levels of riboflavin production rate using 
the sparse regression model Y = Xj3 + e where e rsj ^"(0, (T^l 7 i), with = 1. The 
magnitude of the non-zero components of /? are uniformly simulated jdj ~ U{a,a + 1) 
with a = {0,1,... ,9}. We set the number of non-zeros elements in the vector /3 to 
5. Figure 9 shows the results of the simulation evaluated using the aforementioned 
metrics. Under such extreme high-dimensional conditions, both methods perform 
poorly. SBL has found all the relevant variables but has also selected many non- 
relevant variables. Lasso has produced more sparse solutions, but has missed some 
important variables. The results remain essentially the same even when we set 
(the variance of the noise e) to 0.1. 

One final word on computing times. We compute the SBL estimate using Algorithm 

and we use the package GLMNet to compute lasso. We implemented Algorithm 
in R. The core of the GLMNet package is written in Fortran and the result is very fast. 


(Meinhausen and Yu (2009 
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The comparison of the computing times is largely in favor of GLMNet. Comparing 
computing times is always tricky as it depends to a large extent on the programming 
language and skills. But beyond the implementation differences, it seems clear that 
lasso has a computational advantage over SBL in that it leads to “easier” (convex) 
optimization problems, compared to SBL. 


3. Conclusion 


We have shown that when the design matrix is orthogonal, the SBL estimator is 
uniquely defined, sparse (however does not recover the true sparsity structure of the 
signal), and can be computed using the EM algorithm. We have also proposed a hard- 
thresholded version of SBL, and shown that the hard-thresholded estimator recovers 
the true sparsity structure of the model, and achieves the same estimation error bound 
as lasso (with high probability). Furthermore our simulation results show that the 
method compares very well with lasso, and outperforms lasso when the regression 
coefficients are not too small. 

One important and pressing issue is the extension of these results to non-orthogonal 
design matrices. In particular we wish to understand the type of design matrix X 
for which these results continue to hold. This SBL theory and its comparison with 


the recently developed lasso theory (see for instance Meinhausen and Yu (2009); 


Bickel et al. (2009)) could potentially give new insight into high-dimensional regression 


analysis. The generalized singular value decomposition (see e.g. Golub and Van Loan 


( 2013[ )) of and T.^ seems to be a promising approach to tackle this problem. 
The challenge in this approach appears to be the development of an appropriate 
differentiability theory for the components of the GSVD decomposition as a function 
of 7 . 

The SBL method can be extended in several directions. It can be easily extended 
to deal with generalized linear models, and graphical models. But in these exten¬ 
sions, the computation of the estimator might require some new algorithms. Another 
possible extension of the method would be to replace the Gaussian distribution in the 
prior by some other distribution. Some of these extensions of the methodology are 


already being explored. For instance Balakrishnan and Madigan (2010) replaced the 


Gaussian distribution by the double-exponential distribution and shows by simulation 
that the resulting estimator (called demi-lasso) compares very well with lasso and the 
standard SBL. 
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Relative Error rate for rho = 0 


Sensitivity for rho = 0 


Specificity for rho = 0 





a 


a 


a 


figure 1: Sensitivity, specificity and relative error for SBL and lasso as function of 


a. s = 3. 


Relative Error rate for rho = 0 


Sensitivity for rho = 0 


Specificity for rho = 0 





a 


a 


a 


Figure 2: Sensitivity, specificity and relative error for SBL and lasso as function of 

a. s = 15. 
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Relative Error rate for rho = 0 


Sensitivity for rho = 0 


Specificity for rho = 0 





a 


a 


a 


figure 3: Sensitivity, specificity and relative error for SBL and lasso as function of 


a. s = 25. 


Relative Error rate for rho = 0 


Sensitivity for rho = 0 


Specificity for rho = 0 





a 


a 


a 


Figure 4; Sensitivity, specificity and relative error for SBL and lasso as function of 

a. s = 50. 
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Relative Error rate for rho = 0.9 


Sensitivity for rho = 0.9 


Specificity for rho = 0.9 





a 


a 


a 


figure 5: Sensitivity, specificity and relative error for SBL and lasso as function of 


a. s = 3. 


Relative Error rate for rho = 0.9 


Sensitivity for rho = 0.9 


Specificity for rho = 0.9 





a 


a 


a 


Figure 6: Sensitivity, specificity and relative error for SBL and lasso as function of 

a. s = 15. 
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Relative Error rate for rho = 0.9 


Sensitivity for rho = 0.9 


Specificity for rho = 0.9 





a 


a 


a 


figure 7: Sensitivity, specificity and relative error for SBL and lasso as function of 


a. s = 25. 


Relative Error rate for rho = 0.9 


Sensitivity for rho = 0.9 


Specificity for rho = 0.9 





a 


a 


a 


Figure 8: Sensitivity, specificity and relative error for SBL and lasso as function of 

a. s = 50. 
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Relative Error rate 


Sensitivity 
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Figure 9: Sensitivity, specificity and relative error for SBL(thresholded) and lasso as 
function of a. s = 5. The results of this simulation is generated using the riboflavin 

data 


4. Proofs 

4.1. Proof of Proposition!^ 

Proof. For any i £ {1,... ,p}, 

^( 7 ) - logdet(C.^) < logdet {aHn + JiXix'i) I - 00 , 

as 7 i ^ 00 . This together with the continuity of 7 1 —)■ £( 7 ) imply the existence 
of a maximizer. For any such maximizer 7 , consider the vector 7 such that the j- 
th component of 7 is free to vary and the remaining components 7 _j are fixed to 
7 _j. Then we write = Cj + 'jjXjXj and use the matrix identity (^4 + uu')~^ = 
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A-1- 


l-\-u'A ^ 


to deduce that 


c-i = cr' - 


^ 1 + "ijx'-C-^Xj 


Therefore, 


1 ^ 1 ^y) 

^(t) = “2 logdet {Cj + -fjXjx'j) + - - 


+ -fjx'-C- Xj 2 


Since Cj does not depend on 'jj, we easily see that jj i—>■ £(7) is differentiable on 
(0, 00) and 

2 / , \ 2 


1 ,„-i 

= --XjC^ Xj + - 


1 

+ A- 




^ (1 + 7ix'cr'x.y 

If ^^£(7) < 0 and 7j 1—>■ £(7) attains its maximum at 0 . Simi¬ 
larly if > x'jCj^Xj, it is easy to check that 7j i-A £(7) attains its maximum 

at {{^^'jCj^y^ — x'jCj^Xj)! (^x'jCj^x^ . Now if 7^ differs from the maximizer just 
found, we can improve on the likelihood by setting 7^ equal to that maximizer, which 
would be a contradiction. Hence the result. 

□ 


4 . 2 . Proof of Proposition!^ 

Proof. Recall that I = {1 < j < p : /3*j / 0} is the sparsity structure of /3*. For 

def def 

7 G 0, and 1 < j < p, we define Iq = I n l.y \ {j}, and b = Fn l.y \ {j}, where in order 
to keep the notation easy, we omit the dependence of Iq and b on (7, j). We will also 
write X\g (resp. X|^) to denote the matrix obtained by collecting the columns of X 
whose indexes belong to Iq (resp. b)- We define 

= a In + 2^ IkXkXf, 

ke\^\{j} 

= aHn + ^ JkXkx'k + ^ IkXkx'k- 
fcelo fceii 
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By the Woodbury matrix identity and the assumption = 0, we get: 

- ^^1. (fi;'+ " y„ - "x;.. 

Hence, for k £ \, and using the fact that j ^ I, we have 


3 3,1 

Therefore, if T = X/3* + e, we get 


x'^C^^Xk = 0, and x^C^J^Xj = ■^\\xj\\'^. 


^' 3 ^ 3 , ^ ^^3X) ~ N {0,a‘^\\xjf) . 
Now, the matrix Cj defined in Propositionis Cj = Hence 




= P [Z^ < 1] 


T'J — 0] — P 

where Z ~ N(0,1). Hence the result. 

4.3. Proof of Theorem [d]. 

Proof. Under XjCj}^^Y = {xj,Y) and xjCjj^^xj = ||xj|p/c7^. Hence 

{xj,Y)‘^-P\\xj\ 


7i = 


if (Xj, y)^ > (T^||xj|p(l + z*) 
otherwise. 


Similarly, under HI ^n,j has the explicit form ^n,j = —It follows that 

, if {Y,Xj)'^ > cr'^\\xj\\'^{l +z^) 


/3n,j — 


rn,j 


0 


otherwise . 


□ 


Again using the orthogonality assumption of X, we obtain {Y,Xj) = /3*j||xj|p + 

Hpf 

{e,Xj). We set tj = {e,Xj). Then it follows that 


Pn,j P*,] — ^ 


0 


'*,3 




'^rij 


if /3*j = 0, and 

(^)> 

11^^.112(1 + 2;*) 

if /3*j = 0, and I 

(^) >, 

i4(i+«) 

if /3*j' / 0 and ^ 

¥J¥ + 


if /3*j' / 0 and ^ 

W + 
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Suppose that j G I = and ^ y < ||^,|p (1 + 2 ;*). Then with Zj 




Hence for such index j, 


..112 


+ — T1-[T + \/l + ^*) 


a 


r,2 

1 < 


\Xi 


2 \ J 


(14) 


But for j G 1^^, such that > ||^(1 + %,j = 

2 

^-T^. Using this with some easy algebra, we obtain that for such index j, 




Ir •l|2 -I- ^ 

^ ^ri,j 


2 l^*d II™.112 


a 


xj\r tj+ \\xj\ri3i.,j 


(15) 


It follows that 


|xj|p/3*j + tj 




f^*d 


< 


a 


\Xi 


{l + \Zj\). 


With (14) and (16), we get 


(16) 


E 


nd - P*d) - ~ X] )’ 

fGb* 


where c = mini<j<p ||xj|p/n. Set s By Teicher (1984) Lemma 5, IE(|Z? — 

ll*') < k > 2. Hence by Bernstein’s inequality (see e.g. van der Vaart and 


Wellner (1996) Lemma 2.2.11), we conclude that 


(1 + 2* + Zj) > 2(1 + z^)s 

< P 

Y1 - 1) > 



_i6L* 


< exp 


z'^s'^ \ / z*s\ 1 


Hence with probability at least 1 


1 

pCQs/S ) 





4it^ 

-(1 + z*) s < 

cn 


4(1 + Co) fJ^slogp 
c n 


( 17 ) 
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On the other hand, from (|15|), - - — hence 


E 


j - A.j) = E 


cr 


, Z?>l+2* 




x,||2 V"-’ Zj - cn ^ 


Ez|i| 


- ^E^h{z|>i«.)' 

i=i 


Now for any k G (0,1/2), a > 0, and by Markov’s inequality 



p 


/ p \ 

E 


= E 

exp Y,^^]hzpi+z.} > 6“'" 


_i=i 


L v^=i / J 


< exp 


—an + plogE 


exp UZ^l^zl>i+z.}} -( 18 ) 


We calculate that 


E 


exp iKZ^l^zf>i+z,} 


= 2 


f\/r+W g-a;2/2 


< 1 + 2 


/: 


e~2 


dx + 2 


L 


oo g—i(l—2 k)x^ 


oo —^( 1 — 2 k ) x ^ 


< 1 + 


exp 


+1+51 

z.{1-2k) 


vT+il i/27r 
dx 


dx 


1 - 2k 

where the last inequality uses some easy algebra and the well known bound on the 
Gaussian cdf: t > 0. With z. = cologp. 

We deduce that 

/ eo(l-2K) \ 


plogE 


exp UZ^l^z^^-^+z.} 


< p log 1 + 


P 


< 


P 


cqk, 


(1 - 2k) - 1-2k 


Hence with a = ^ f_ 2 ^ < 4k (18) gives 


i=i 


< exp —OK + 


p 


cqk 


1 - 2k 


< exp (-p"°''). 


We conclude that with probability at least 1 — exp(—p'’°'^), Yl^j=i < a < 

4k“^P'’°'‘, so that 

1 2 4(t2 


E 

301* 


i,j - P.,j ) <—K 


(19) 
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with probability at least 1 — exp(—Combining (© and (|19|) it follows that 


-Ml< 




cn 


(1 + co)slogp + 


P 


CqK 


with probability at least 1 — — exp(—Finally since logs > 1, we can take 

K = log(s)/(co log(p)) G (0,1/2) to achieve s = With this choice, 

pCQK, slog(p) ^ slog(p) 


K Co log(s) 


Co 


and the theorem follows easily. 


□ 


4.4. Proof of Corollary 

Proof. Recall that I = I 7 * = {1 < j < p : 7 ^ 0}. We write = (uj, j G l^^), 

and = {uj, j ^ 1 ^^). It is clear that whenever ( |ll| holds and \Pn,j — f3*,j\ < 
y/Ma‘^slog{p)/n, we have sign(/3„j) = sign(/3*j). Since \^n,j - P*,j\ < ll^n - , 0 *|| 2 , we 
conclude that sign(/3„,i^J = sign(^*j^J, with probability at least 1 - 

For the other part, it follows from the definition of /?„ that for /3*j = 0, sign(/3,ij) / 
0 implies that > 1 + z*. Hence 

P (sign/ sign^ P (Z| > 1 + z*) = ^ 2P {Zj > y/l + z*) 


i=i 

p 


i=i 


< 2 ^+^*) < exp (logp — — logp 

i=i 

1 


— £0 ' 
p 2 


The results follows. 


□ 


4.5. Proof of Proposition!^ 

Proof. We fix an arbitrary j G {1,... ,p}. We define Xk = ||a^jwhere we omit 


the dependence on j to keep the notation simple. It follows from (13) that 

2 

/ . 7 : U \ 

Xk+i = B 


Xk 


+ Xk 


<X^Xk .r / N 

+ -= ^{Xk), 


+ Xk 


where B = (y,/||xj|p, and 


T(x) = B 


+ X 


+ 


a^x 

+ X 
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Notice that for all x > 0, G [0,+ B]. Hence the sequence {xk, A: > 0} is 

bounded. The equation 'I'(x) = x is equivalent to x^(x + = x^B. If B < 

'I'(x) = X has a unique solution x = 0. If H > cr^, then T(x) = x has two solutions 
X = 0 and x = B — a^. The derivatives of T are given by 

, 2xa^B „ —2a‘^x(a‘^ + 2B) + 2a‘^{B — a‘^) 

^ i^) = -r^ -vJ + ^ (®) =-^ ^ 9 ^ 

(cj2 + x)2 (cr^ + x)3 (cr^ + x)^ 

We consider two cases 

(1) Case 1: H < cr^. Then 'h"(x) < 0 for all x > 0. Hence T is concave. This 
implies that for all x > 0, 

T(x) < T(0) + T'(0)x = X. 

This implies that x^ = 'I'(xfc_i) < x^-i- This means that the sequence 
{xfc, k > 0} is bounded and non-increasing, hence has a limit x*. By conti¬ 
nuity of T, the limit point x* satisfies T(x*) = x*. Hence x* = 0, since we 
have seen above that 0 is the only fixed-point of T when B < . 

(2) Case 2: B > Then T"(0) > 0, and by Taylor expansion, in a neighbor¬ 

hood of 0, we have T(x) > T(0) -|- T'(0)x = x for all x > 0 small enough. If 
X* = Bj — cr^ denotes the unique positive fixed point of 'h, we can conclude 
that for all x G [0,x*], T(x) > x, and for x > x*, 'I'(x) < x. Therefore, if 
xo G [0, X*], then {xfc, /c > 0} is increasing and bounded, hence converges to 
the unique positive fixed point x* (recall that xq > 0). Whereas, if xq > x*, 
then {xfc, /c > 0} is decreasing and bounded, hence converges to the unique 
positive fixed point x*. 

□ 
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